Summary

Bellabeat is a small high-tech manufacturer of health-focused smart products founded in 2013, they collect data on activity, sleep, stress, and reproductive health to empower women with knowledge about their own health and habits.

This report will analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices to reveal more opporunities for growth.

We will be focusing on Bellabeat’s app. Their app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.

Prepare Phase

The dataset used for this report is FitBit Fitness Tracker Data. The dataset is made available through Mobius in Kaggle.

These dataset were generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016 - 05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.

The dataset is licensed: CC0: Public Domain - The person who associated a work with this deed has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.

Understanding the Dataset

library(tidyverse)
library(readxl)
library(readr)
library(dplyr)
library(skimr)
library(janitor)
library(magrittr)
library(ggplot2)
library(lubridate)
library(ggpubr)
library(gapminder)
library(gganimate)
library(ggthemes) 
library(transformr) 
library(gifski) 

We will first take a quick look at the number of rows, columns and number of unique participants.

#capturing all the dataset
files <- list.files(
  path = "D:\\User\\Courses\\Google Data Analytics coursera\\case study dataset",
  pattern  = ".csv",
  full.names = TRUE
)

#loop to store num of rows & cols & unique IDs
numrows <- c()
numcols <- c()
numunique <- c()

for (i in 1:18) {
  xzy <- read.csv(files[i])
  numrows <- append(numrows, nrow(xzy))
  numcols <- append(numcols, ncol(xzy))
  numunique <- append(numunique, n_unique(xzy$Id))
}

#remove path name from files
files <- list.files(
  path = "D:\\User\\Courses\\Google Data Analytics coursera\\case study dataset",
  pattern  = ".csv",
  full.names = FALSE
)

#loop to store dataset name
datasetlist <- c()

for (i in 1:18) {
  datasetlist <- append(datasetlist, files[i])
}

#dataset details
excelrowscols <- data.frame(dataset = datasetlist, rows = numrows, 
                            cols = numcols, numofparticipants = numunique)

excelrowscols[order(excelrowscols$rows),]
##                               dataset    rows cols numofparticipants
## 18                sleepDay_merged.csv     413    5                24
## 2            dailyActivity_merged.csv     940   15                33
## 3            dailyCalories_merged.csv     940    3                33
## 4         dailyIntensities_merged.csv     940   10                33
## 5               dailySteps_merged.csv     940    3                33
## 11      minuteCaloriesWide_merged.csv   21645   62                33
## 13   minuteIntensitiesWide_merged.csv   21645   62                33
## 17         minuteStepsWide_merged.csv   21645   62                33
## 7           hourlyCalories_merged.csv   22099    3                33
## 8        hourlyIntensities_merged.csv   22099    4                33
## 9              hourlySteps_merged.csv   22099    3                33
## 15             minuteSleep_merged.csv  188521    4                24
## 1         combined dataset merged.csv 1048575    3                 7
## 10    minuteCaloriesNarrow_merged.csv 1325580    3                33
## 12 minuteIntensitiesNarrow_merged.csv 1325580    3                33
## 14        minuteMETsNarrow_merged.csv 1325580    3                33
## 16       minuteStepsNarrow_merged.csv 1325580    3                33
## 6        heartrate_seconds_merged.csv 2483658    3                14

Highlights about the dataset:

Datasets are mainly broken down into:

  1. dailyActivity_merged.csv - data on the daily activities of 33 user. The data tracked included Steps, Distance, Intensities, Calorie traced daily.
    • dailyCalories_merged.csv, dailyIntensities_merged.csv, dailySteps_merged.csv are all subsets of dailyActivity_merged.csv.
    • hourlyCalories_merged.csv, hourlyIntensities_merged.csv, hourlySteps_merged.csv are also subsets of dailyActivity_merged.csv but data are tracked every hour of everyday.
    • minuteCaloriesWide_merged.csv, minuteIntensitiesWide_merged.csv & minuteStepsWide_merged.csv are also subset of dailyActivity_merged.csv., but data are tracked every minute and presented in wide format.
    • minuteCaloriesNarrow_merged.csv, minuteIntensitiesNarrow_merged.csv, minuteStepsNarrow_merged.csv are also subset of dailyActivity_merged.csv., but data are tracked every minute and presented in long format.
  2. minuteMETsNarrow_merged.csv is data of how how much energy a person expends based on the amount of oxygen consumed by the body with 33 unique users identified.
  3. sleepDay_merged.csv is data on user’s daily sleep logs. The logs included total count of sleeps a day, total minutes asleep & total time in bed. Only 24 unique users identified.
  4. heartrate_seconds_merged.csv is data on user’s heartrate recorded in intervals of between 5 - 20 seconds and only 14 users identified.
  5. weightLogInfo_merged.csv is data on user’s weight and BMI. Days tracked are inconsistent among users and only 8 unique users identified.
  6. minuteSleep_merged.csv has no extra details provided on columns variables. Unable to extract further information about this dataset.

Statistical inference of the dataset

These datasets are an example of non-probability samples as they were generated by respondents to a distributed survey via Amazon Mechanical Turk, in addition they have small samples size of between 8 to 33 samples varying across each dataset. Hence we like to highlight that there will be risk of sampling bias and that sampled units are not representative of larger target population of interest.

Process Phase

From earlier we see that there are some datasets with >1,048,576 rows, which exceed the maximum number of rows excel can load. Hence we will be using RStudio to process, clean and share our data.

We will be using the following dataset for our analysis:

We chose to not use the calories/intensity/steps dataset as they are subsets of the dailyActivity_merged.csv. weightLogInfo_merged.csv & minuteSleep_merged.csv are not favourable for analysis due to their small sample size, inconsistent data and incomplete metadata of the dataset.

Deeper look at our dataset

Now that we understood more about our data structures, we will process them to look for any errors and inconsistencies.

#read files
daily_activities <- read_csv('D:\\User\\Courses\\Google Data Analytics coursera\\case study dataset\\dailyActivity_merged.csv')
sleep_log <- read_csv('D:\\User\\Courses\\Google Data Analytics coursera\\case study dataset\\sleepDay_merged.csv')
minmets <- read_csv('D:\\User\\Courses\\Google Data Analytics coursera\\case study dataset\\minuteMETsNarrow_merged.csv')

#check colume types
str(daily_activities)
## spec_tbl_df [940 x 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                      : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr [1:940] "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : num [1:940] 13162 10735 10460 9762 12669 ...
##  $ TotalDistance           : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num [1:940] 8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num [1:940] 1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num [1:940] 6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : num [1:940] 728 776 1218 726 773 ...
##  $ Calories                : num [1:940] 1985 1797 1776 1745 1863 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   ActivityDate = col_character(),
##   ..   TotalSteps = col_double(),
##   ..   TotalDistance = col_double(),
##   ..   TrackerDistance = col_double(),
##   ..   LoggedActivitiesDistance = col_double(),
##   ..   VeryActiveDistance = col_double(),
##   ..   ModeratelyActiveDistance = col_double(),
##   ..   LightActiveDistance = col_double(),
##   ..   SedentaryActiveDistance = col_double(),
##   ..   VeryActiveMinutes = col_double(),
##   ..   FairlyActiveMinutes = col_double(),
##   ..   LightlyActiveMinutes = col_double(),
##   ..   SedentaryMinutes = col_double(),
##   ..   Calories = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(sleep_log)
## spec_tbl_df [413 x 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id                : num [1:413] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : chr [1:413] "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : num [1:413] 1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: num [1:413] 327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : num [1:413] 346 407 442 367 712 320 377 364 384 449 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   SleepDay = col_character(),
##   ..   TotalSleepRecords = col_double(),
##   ..   TotalMinutesAsleep = col_double(),
##   ..   TotalTimeInBed = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(minmets)
## spec_tbl_df [1,325,580 x 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Id            : num [1:1325580] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityMinute: chr [1:1325580] "4/12/2016 12:00:00 AM" "4/12/2016 12:01:00 AM" "4/12/2016 12:02:00 AM" "4/12/2016 12:03:00 AM" ...
##  $ METs          : num [1:1325580] 10 10 10 10 10 12 12 12 12 12 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Id = col_double(),
##   ..   ActivityMinute = col_character(),
##   ..   METs = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
#check for duplicates
sum(duplicated(daily_activities))
## [1] 0
sum(duplicated(sleep_log))
## [1] 3
sum(duplicated(minmets))
## [1] 0

From above, we have noticed that:

We will then proceed to correct the date type as well as to remove duplicates

#remove duplicates and NA rows
daily_activities <- daily_activities %>% distinct() %>% drop_na()
sleep_log <- sleep_log %>% distinct() %>% drop_na()
minmets <- minmets %>% distinct() %>% drop_na()

#check
sum(duplicated(daily_activities))
## [1] 0
sum(duplicated(sleep_log))
## [1] 0
sum(duplicated(minmets))
## [1] 0
#correct the date type and rename column names for merging
daily_activities <- daily_activities %>%
  rename(date = ActivityDate) %>%
  mutate(date = as.Date(date, format = "%m/%d/%Y"))

#format as date without time as all the time in sleep_log is at 12am
sleep_log <- sleep_log %>%
  rename(date = SleepDay) %>%
  mutate(date = as.Date(date, format = "%m/%d/%Y"))

minmets <- minmets %>%
  mutate(ActivityMinute = as.POSIXct(ActivityMinute, format = "%m/%d/%Y %I:%M:%S %p"))

#check
colnames(daily_activities)
##  [1] "Id"                       "date"                    
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"
class(daily_activities$date)
## [1] "Date"
colnames(sleep_log)
## [1] "Id"                 "date"               "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
class(sleep_log$date)
## [1] "Date"
colnames(minmets)
## [1] "Id"             "ActivityMinute" "METs"
class(minmets$ActivityMinute)
## [1] "POSIXct" "POSIXt"

With the dates corrected and duplicates removed, we proceeded to merge daily_activities and sleep_log dataset for analysis.

#merge
daily_activities_sleep <- merge(daily_activities, sleep_log)

#check
str(daily_activities_sleep)
## 'data.frame':    410 obs. of  18 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ date                    : Date, format: "2016-04-12" "2016-04-13" ...
##  $ TotalSteps              : num  13162 10735 9762 12669 9705 ...
##  $ TotalDistance           : num  8.5 6.97 6.28 8.16 6.48 ...
##  $ TrackerDistance         : num  8.5 6.97 6.28 8.16 6.48 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.14 2.71 3.19 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 1.26 0.41 0.78 ...
##  $ LightActiveDistance     : num  6.06 4.71 2.83 5.04 2.51 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : num  25 21 29 36 38 50 28 19 41 39 ...
##  $ FairlyActiveMinutes     : num  13 19 34 10 20 31 12 8 21 5 ...
##  $ LightlyActiveMinutes    : num  328 217 209 221 164 264 205 211 262 238 ...
##  $ SedentaryMinutes        : num  728 776 726 773 539 775 818 838 732 709 ...
##  $ Calories                : num  1985 1797 1745 1863 1728 ...
##  $ TotalSleepRecords       : num  1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep      : num  327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed          : num  346 407 442 367 712 320 377 364 384 449 ...

As sleep_log dataset only have 24 samples, the combined dataset from daily_activities & sleep_log can only have a maximum of 24 samples instead of 33.

These will be the datasets for our analysis:

  1. daily_activities_sleep (combination of ‘dailyActivity_merged.csv’ & ‘sleepDay_merged.csv’)
  2. minmets (from ‘minuteMETsNarrow_merged.csv’)

Analyze & Share Phase

We will now begin to analyze the selected dataset to see what we can learn from them.

Based on the data that we have, we will explore these 2 areas:

  1. Activity level (using: ‘daily_activities_sleep’ & ‘minmets’)
  2. Sleep behaviour (using: ‘daily_activities_sleep’)

1. User’s activity level

Calories

Based on Fitbit’s FAQ, their devices will also record user’s basal metabolic rate (BMR) - the rate at which you burn calories at rest to maintain vital body functions (including breathing, blood circulation, and heartbeat) even when the users are not actively doing anything. User’s BMR is based on the physical data they entered in to their Fitbit account (height, weight, sex, and age) and accounts for at least half the calories they burn in a day.

According to this article an average person BMR is approximately 1800 calories per day. Hence we will deduct 1800 calories from the recorded calories burned to find the daily average estimate of calories a user burns on top of their BMR.

Steps

Walking 10,000 steps per day is the recommended guideline to meet for healthy adults to achieve health benefits.

Next we will now look at user’s average steps & calories burned per day.

daily_activities_sleep %>% 
  group_by(Id) %>% 
  summarise(avgsteps = mean(TotalSteps), avgcal = mean(Calories), addcalburn = avgcal - 1800) %>% 
  select(avgsteps, avgcal, addcalburn) %>% 
  summary(addcalburn)
##     avgsteps         avgcal       addcalburn    
##  Min.   : 1490   Min.   :1541   Min.   :-259.2  
##  1st Qu.: 5106   1st Qu.:1947   1st Qu.: 147.4  
##  Median : 8336   Median :2248   Median : 448.2  
##  Mean   : 7880   Mean   :2397   Mean   : 597.3  
##  3rd Qu.: 9394   3rd Qu.:2988   3rd Qu.:1188.2  
##  Max.   :19079   Max.   :3539   Max.   :1739.2
daily_activities_sleep %>% 
  transform(Id = as.character(Id)) %>% 
  group_by(Id) %>% 
  summarise(avgsteps = mean(TotalSteps), avgcal = mean(Calories)) %>% 
  ggplot() +
  geom_col(aes(Id, avgsteps, fill = 'grey')) +
  geom_col(aes(Id, avgcal, fill = 'blue')) +
  geom_hline(yintercept = 10000, colour = 'black') + #10000 is the steps per day
  geom_hline(yintercept = 1800, colour = 'red', size = 0.8) + #1800 is the BMR calories
  scale_y_continuous(breaks = c(0, 1800, 5000, 10000, 15000, 20000)) +
  theme(axis.text.x = element_text(angle = 90)) +
  xlab('participant ID') +
  ylab('avg steps & calories burn per day') +
  scale_fill_manual(values = c('blue', 'grey'), name = 'Legend', labels = c('Calories', 'Steps'))

From above statistical summary, users clocked an average steps of 7880 per day, which is lesser than the recommended guideline of 10000 steps per day. However they are burning an additional 597.3 calories on average per day on top of their BMR.

Next, the graph above narrows down to individual user’s average daily steps and calories. It shows only 5 users have managed to hit the recommended 10000 steps per day and while the rest who did not, some have still managed to burn as much calories as those who did - for example ID 4020332650 vs 4388161847 and 8053475328 vs 8378563200.

This suggests that clocking more steps does not equate to burning more calories. Users may be doing other form of activities that did not involve tracking their steps such as swimming.

Sedentary vs Active behaviour

A sedentary behaviour involves long periods of sitting and/or lying down, with very little to no exercise. Quoting from World Health Organisation (WHO) ‘Sedentary lifestyles increase all causes of mortality, double the risk of cardiovascular diseases, diabetes, and obesity, and increase the risks of colon cancer, high blood pressure, osteoporosis, lipid disorders, depression and anxiety…Among the preventive measures recommended by WHO are moderate physical activity for up to 30 minutes every day’

We will take a look at users’ sedentary vs active behaviours

daily_activities_sleep %>% 
  transform(Id = as.character(Id)) %>% 
  group_by(Id) %>% 
  summarise(avgsed = mean(SedentaryMinutes), avgactive = mean(VeryActiveMinutes+FairlyActiveMinutes)) %>% 
  ggplot() +
  geom_col(aes(Id, avgsed, fill = 'grey')) +
  geom_col(aes(Id, avgactive, fill = 'blue')) +
  theme(axis.text.x = element_text(angle = 90)) +
  xlab('participant ID') +
  ylab('avg Sedentary & Active mins per day') +
  scale_fill_manual(values = c('blue', 'grey'), name = 'Legend', labels = c('Active minutes', 'Sedentary Minutes'))

We can see a stark difference between the user’s sedentary and active periods. However this may also be heavily influenced by user’s occupation. For example if we were to compare a labourer’s sedentary and active periods to that of a white collar worker’s data, it is reasonable to assume the labourer would yield higher active to sedentary minutes per day.

Next we will zoom in on users’ active period.

daily_activities_sleep %>% 
  transform(Id = as.character(Id)) %>% 
  group_by(Id) %>% 
  summarise(avgactive = mean(VeryActiveMinutes+FairlyActiveMinutes)) %>% 
  ggplot() +
  geom_col(aes(Id, avgactive)) +
  geom_hline(yintercept = 30, colour = 'red') +
  scale_y_continuous(breaks = c(0, 30, 50, 100)) +
  theme(axis.text.x = element_text(angle = 90)) +
  xlab('participant ID') +
  ylab('avg active min per day')

Based on WHO’s recommendation of moderate physical activity of up to 30 minutes every day, we can see an even distribution of users that can and cannot meet this guideline.

Metabolic Equivalent Minutes (MET)

MET is a measure of how how much energy a person expends based on the amount of oxygen consumed by the body. Energy expenditure may differ from person to person based on several factors, including your age and fitness level. For example, a young athlete who exercises daily won’t need to expend the same amount of energy during a brisk walk as an older, sedentary person….The American Heart AssociationTrusted Source recommends at least 150 minutes of moderate-intensity aerobic exercise each week for optimal cardiovascular health. That’s equal to about 500 MET minutes per week, according to the Department of Health and Human Services

According to Fitbit’s data dictionary, the recorded MET in the datasets are by default multipled by 10, hence we will divide these numbers by 10 to obtain the actual MET values.

Below chart look at each user’s daily combined MET values. We will be splitting the chart into 2 parts for easier viewing.

minmets %>% 
  mutate(date = date(ActivityMinute)) %>%
  mutate(time = format(ActivityMinute, format = "%H%M")) %>%
  transform(time = as.numeric(time)) %>% 
  mutate(time = time/100) %>%
  relocate(METs, .after = time) %>% 
  filter(row_number() <= 625860) %>% 
  ggplot() +
  geom_line(aes(time, METs/10)) +
  ylab('MET') +
  xlab('Time') +
  ggtitle('Part 1') +
  geom_hline(yintercept = 6, colour = 'blue') +
  scale_x_continuous(breaks = c(0, 3, 6, 9, 12, 15, 18, 21)) +
  scale_y_continuous(breaks = c(0, 6, 10, 15)) +
  facet_wrap(~Id)

minmets %>% 
  mutate(date = date(ActivityMinute)) %>%
  mutate(time = format(ActivityMinute, format = "%H%M")) %>%
  transform(time = as.numeric(time)) %>% 
  mutate(time = time/100) %>%
  relocate(METs, .after = time) %>% 
  filter(row_number() >= 625861) %>% 
  ggplot() +
  geom_line(aes(time, METs/10)) +
  ylab('MET') +
  xlab('Time') +
  ggtitle('Part 2') +
  geom_hline(yintercept = 6, colour = 'blue') +
  scale_x_continuous(breaks = c(0, 3, 6, 9, 12, 15, 18, 21)) +
  scale_y_continuous(breaks = c(0, 6, 10, 15)) +
  facet_wrap(~Id)

According to this article, vigorous activities will have a MET value of approximately => 6 based upon an average person’s weight of 70 kg. Higher MET values indicate higher energy expenditures.

The two charts above represent the energy expenditure of each user throughout the day, we can observe a routine based upon the pattern of the chart.

For example, data from user ID 8253242879 (Part 2, col 3 row 3) spike between 9 am and 12 pm, with values reaching as high as >10 MET. It is reasonable to assume that this may be the user’s workout timing each day.

MET data from user ID 1503960366 (Part 1, col 1 row 1) is consistently >6 throughout the day, suggesting that this user may be engaged in labour intensive job.

However if we were to look at an individual user energy expenditure on a day by day basis, we will be able to observe a new trend. We will focus on user ID 1503960366 for illustration.

graph1 <- minmets %>% 
  mutate(date = date(ActivityMinute)) %>%
  mutate(time = format(ActivityMinute, format = "%H%M")) %>%
  transform(time = as.numeric(time)) %>% 
  mutate(time = time/100) %>% 
  relocate(METs, .after = time) %>% 
  filter(Id == 1503960366) %>%
  ggplot() +
  geom_line(aes(time, METs/10)) +
  ylab('METs') +
  xlab('Time') +
  ggtitle('MET over days') +
  scale_x_continuous(breaks = c(0, 3, 6, 9, 12, 15, 18, 21)) +
  scale_y_continuous(breaks = c(0, 1, 6, 10, 15)) +
  geom_hline(yintercept = 6, colour = 'blue') +
  transition_time(date) +
  labs(subtitle = "Date: {frame_time}")

animate(graph1, height = 500, width = 1000, fps = 30, duration = 20, end_pause = 60, res = 100)

In our initial analysis above, we have suggested that user ID 1503960366 may be engaged in a labour intensive job for long hours as the user’s MET data were consistently >6 throughout the day. However, the day by day visualisation is suggesting otherwise. The energy expenditure observed is >6 at only certain period of the day with this trend varying on a day to day basis. This goes against the initial suggestion that user ID 1503960366 maybe engaged in labour intensive job.

The line chart that we have initially plotted which shows every user’s data was based on static data. It stacks all the different days into a single chart, hence creating the illusion that user ID 1503960366 has a high energy expenditure throughout the days. Whereas now if we were to look at it on a day by day basis, it shows that user ID 1503960366 is only expending vigorous energy at certain periods of the day, at different times each day.

2. Users’ Sleep behaviour

Sedentary vs Sleep hours

According to Centers for Disease Control and Prevention CDC, adults are recommended to have 7 or more hours per night.

We will explore further whether sedentary behaviours affect a user’s sleep duration.

daily_activities_sleep %>% 
  ggplot(aes(SedentaryMinutes, TotalMinutesAsleep)) +
  geom_point() +
  geom_smooth() +
  geom_hline(yintercept = 420, colour = 'red') + #420mins = 7hrs sleep
  scale_y_continuous(breaks = c(0, 200, 400, 420, 600, 800)) +
  xlab('Total Sedentary Minutes') +
  ylab('Total Minutes Asleep') +
  stat_cor(method = "pearson")

We see a negative correlation (r = -0.6) between users’ sedentary behaviour and their total minutes slept. This suggest that a higher sedentary behavior leads to a lower amount of sleep.

Average time to fall asleep

In order to omit data of users who took naps, we have filtered out for totalsleeprecord = 1, so as to only focus on data of users who slept once per day. We have also removed 2 outliers from the analysis as they took an average of ~300mins & ~150mins to fall asleep.

daily_activities_sleep %>% 
  transform(Id = as.character(Id)) %>% 
  mutate(timetosleep = TotalTimeInBed - TotalMinutesAsleep) %>% 
  filter(TotalSleepRecords == 1, Id != '1844505072', Id != '3977333714') %>% 
  select(timetosleep) %>% 
  summary(timetosleep)
##   timetosleep    
##  Min.   :  0.00  
##  1st Qu.: 16.00  
##  Median : 23.00  
##  Mean   : 26.66  
##  3rd Qu.: 34.00  
##  Max.   :165.00
daily_activities_sleep %>% 
  transform(Id = as.character(Id)) %>% 
  mutate(timetosleep = TotalTimeInBed - TotalMinutesAsleep) %>% 
  filter(TotalSleepRecords == 1, Id != '1844505072', Id != '3977333714') %>% 
  ggplot(aes(Id, timetosleep)) +
  geom_boxplot() +
  geom_hline(yintercept = 26.6, colour = 'red') +
  scale_y_continuous(breaks = c(0, 26.6, 50, 100, 150)) +
  xlab('participant ID') +
  ylab('time to sleep') +
  theme(axis.text.x = element_text(angle = 90))

On average the users took around 26.6 minutes to fall sleep, which is 6 mins more than the average time for people to fall asleep. This suggest possible mild insomnia among the users according to this article.

Conclusion

During our analysis of the two datasets, we have noted that some users have recorded lesser days of data than the rest as shown below.

idlist <- unique(minmets$Id)

xminmets <- minmets %>% 
  mutate(date = date(ActivityMinute))

for (i in 1:33) {
  dates <- xminmets %>% 
  filter(Id == idlist[i]) %>% 
  distinct(date)
  
  print(paste('No.',i, 'ID', idlist[i], ', days of data recorded:', count(dates)))
}
## [1] "No. 1 ID 1503960366 , days of data recorded: 30"
## [1] "No. 2 ID 1624580081 , days of data recorded: 31"
## [1] "No. 3 ID 1644430081 , days of data recorded: 30"
## [1] "No. 4 ID 1844505072 , days of data recorded: 31"
## [1] "No. 5 ID 1927972279 , days of data recorded: 31"
## [1] "No. 6 ID 2022484408 , days of data recorded: 31"
## [1] "No. 7 ID 2026352035 , days of data recorded: 31"
## [1] "No. 8 ID 2320127002 , days of data recorded: 31"
## [1] "No. 9 ID 2347167796 , days of data recorded: 18"
## [1] "No. 10 ID 2873212765 , days of data recorded: 31"
## [1] "No. 11 ID 3372868164 , days of data recorded: 20"
## [1] "No. 12 ID 3977333714 , days of data recorded: 29"
## [1] "No. 13 ID 4020332650 , days of data recorded: 31"
## [1] "No. 14 ID 4057192912 , days of data recorded: 4"
## [1] "No. 15 ID 4319703577 , days of data recorded: 31"
## [1] "No. 16 ID 4388161847 , days of data recorded: 31"
## [1] "No. 17 ID 4445114986 , days of data recorded: 31"
## [1] "No. 18 ID 4558609924 , days of data recorded: 31"
## [1] "No. 19 ID 4702921684 , days of data recorded: 31"
## [1] "No. 20 ID 5553957443 , days of data recorded: 31"
## [1] "No. 21 ID 5577150313 , days of data recorded: 30"
## [1] "No. 22 ID 6117666160 , days of data recorded: 28"
## [1] "No. 23 ID 6290855005 , days of data recorded: 28"
## [1] "No. 24 ID 6775888955 , days of data recorded: 26"
## [1] "No. 25 ID 6962181067 , days of data recorded: 31"
## [1] "No. 26 ID 7007744171 , days of data recorded: 26"
## [1] "No. 27 ID 7086361926 , days of data recorded: 31"
## [1] "No. 28 ID 8053475328 , days of data recorded: 31"
## [1] "No. 29 ID 8253242879 , days of data recorded: 18"
## [1] "No. 30 ID 8378563200 , days of data recorded: 31"
## [1] "No. 31 ID 8583815059 , days of data recorded: 30"
## [1] "No. 32 ID 8792009665 , days of data recorded: 28"
## [1] "No. 33 ID 8877689391 , days of data recorded: 31"
idlist2 <- unique(daily_activities_sleep$Id)

for (i in 1:24) {
  dates <- daily_activities_sleep %>% 
  filter(Id == idlist2[i]) %>% 
  distinct(date)
  
  print(paste('No.',i, 'ID', idlist2[i], ', days of data recorded:', count(dates)))
}
## [1] "No. 1 ID 1503960366 , days of data recorded: 25"
## [1] "No. 2 ID 1644430081 , days of data recorded: 4"
## [1] "No. 3 ID 1844505072 , days of data recorded: 3"
## [1] "No. 4 ID 1927972279 , days of data recorded: 5"
## [1] "No. 5 ID 2026352035 , days of data recorded: 28"
## [1] "No. 6 ID 2320127002 , days of data recorded: 1"
## [1] "No. 7 ID 2347167796 , days of data recorded: 15"
## [1] "No. 8 ID 3977333714 , days of data recorded: 28"
## [1] "No. 9 ID 4020332650 , days of data recorded: 8"
## [1] "No. 10 ID 4319703577 , days of data recorded: 26"
## [1] "No. 11 ID 4388161847 , days of data recorded: 23"
## [1] "No. 12 ID 4445114986 , days of data recorded: 28"
## [1] "No. 13 ID 4558609924 , days of data recorded: 5"
## [1] "No. 14 ID 4702921684 , days of data recorded: 27"
## [1] "No. 15 ID 5553957443 , days of data recorded: 31"
## [1] "No. 16 ID 5577150313 , days of data recorded: 26"
## [1] "No. 17 ID 6117666160 , days of data recorded: 18"
## [1] "No. 18 ID 6775888955 , days of data recorded: 3"
## [1] "No. 19 ID 6962181067 , days of data recorded: 31"
## [1] "No. 20 ID 7007744171 , days of data recorded: 2"
## [1] "No. 21 ID 7086361926 , days of data recorded: 24"
## [1] "No. 22 ID 8053475328 , days of data recorded: 3"
## [1] "No. 23 ID 8378563200 , days of data recorded: 31"
## [1] "No. 24 ID 8792009665 , days of data recorded: 15"

With these new findings, and in addition to the risks we have highlighted above of sampling bias and that sampled units are generally not representative of larger target population of interest, we recommend for Bellabeat to either use their own data or to perform their own survey for a more accurate analysis.

Nevertheless, we can still draw some insights about the users from the analysis performed. We recommend for the following features to be added on Bellabeat’s app:

  1. To set daily reminders to user’s device informing them to hit at least 30 minutes of moderate-intensity activities per day as per WHO’s recommendation and an hourly prompt to user to stand up and move around for 3 minutes to reduce their sedentary behaviour.

  2. To set daily reminders to user’s device 30 minutes before their own pre-set bedtime asking them to get ready for sleep. The reminder also provides a list of dos and don’ts to facilitate their sleep such as e.g. stop using electronic devices as the blue light emitted will affect the body’s ability to prepare for sleep.